AITopics | content encoder

Collaborating Authors

content encoder

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

DarkStream: real-time speech anonymization with low latency

Quamer, Waris, Gutierrez-Osuna, Ricardo

arXiv.org Artificial IntelligenceSep-8-2025

Abstract--We propose DarkStream, a streaming speech synthesis model for real-time speaker anonymization. T o improve content encoding under strict latency constraints, DarkStream combines a causal waveform encoder, a short lookahead buffer, and transformer-based contextual layers. T o further reduce inference time, the model generates waveforms directly via a neural vocoder, thus removing intermediate mel-spectrogram conversions. Evaluations show our model achieves strong anonymization, yielding close to 50% speaker verification EER (near-chance performance) on the lazy-informed attack scenario, while maintaining acceptable linguistic intelligibility (WER within 9%). By balancing low-latency, robust privacy, and minimal intelligibility degradation, DarkStream provides a practical solution for privacy-preserving real-time speech communication. V oice recordings contain rich biometric information that reveals not only linguistic content but also personal attributes such as speaker identity, sex, and age, as well as paralin-guistics (dialect/accent, emotions). Such sensitive information can be exploited by adversaries for speaker recognition and profiling, raising significant privacy concerns.

artificial intelligence, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2509.04667

Country: North America > United States (0.14)

Genre: Research Report (0.82)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Add feedback

FasterVoiceGrad: Faster One-step Diffusion-Based Voice Conversion with Adversarial Diffusion Conversion Distillation

Kaneko, Takuhiro, Kameoka, Hirokazu, Tanaka, Kou, Kondo, Yuto

arXiv.org Machine LearningAug-26-2025

A diffusion-based voice conversion (VC) model (e.g., V oice-Grad) can achieve high speech quality and speaker similarity; however, its conversion process is slow owing to iterative sampling. FastV oiceGrad overcomes this limitation by distilling V oiceGrad into a one-step diffusion model. However, it still requires a computationally intensive content encoder to disentangle the speaker's identity and content, which slows conversion. Therefore, we propose FasterV oiceGrad, a novel one-step diffusion-based VC model obtained by simultaneously distilling a diffusion model and content encoder using adversarial diffusion conversion distillation (ADCD), where distillation is performed in the conversion process while leveraging adversarial and score distillation training. Experimental evaluations of one-shot VC demonstrated that FasterV oiceGradachieves competitive VC performance compared to FastV oiceGrad, with 6.6-6.9 and 1.8 times faster speed on a GPU and CPU, respectively.

artificial intelligence, conversion, machine learning, (16 more...)

arXiv.org Machine Learning

2508.17868

Country: Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Speech (0.95)

Add feedback

Streaming Non-Autoregressive Model for Accent Conversion and Pronunciation Improvement

Nguyen, Tuan-Nam, Pham, Ngoc-Quan, Akti, Seymanur, Waibel, Alexander

arXiv.org Artificial IntelligenceJun-23-2025

We propose a first streaming accent conversion (AC) model that transforms non-native speech into a native-like accent while preserving speaker identity, prosody and improving pronunciation. Our approach enables stream processing by modifying a previous AC architecture with an Emformer encoder and an optimized inference mechanism. Additionally, we integrate a native text-to-speech (TTS) model to generate ideal ground-truth data for efficient training. Our streaming AC model achieves comparable performance to the top AC models while maintaining stable latency, making it the first AC system capable of streaming.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2506.1658

Country: Europe > Germany (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)

Add feedback

Stepback: Enhanced Disentanglement for Voice Conversion via Multi-Task Learning

Yang, Qian, Graham, Calbert

arXiv.org Artificial IntelligenceJan-26-2025

VAEs consist of two main parts: a content Voice conversion (VC) modifies voice characteristics while encoder and a decoder. The content encoder processes source preserving linguistic content. This paper presents the Stepback speech, transforms it into a latent representation, and removes network, a novel model for converting speaker identity using speaker information. The decoder takes the speaker identity, non-parallel data. Unlike traditional VC methods that rely on combines it with the latent representation, and reconstructs the parallel data, our approach leverages deep learning techniques speech[5]. A notable VAE approach is disentangling speaker to enhance disentanglement completion and linguistic content and content representations using instance normalization, which preservation.

artificial intelligence, content encoder, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2501.15613

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report > Promising Solution (0.48)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

JOOCI: a Framework for Learning Comprehensive Speech Representations

Yadav, Hemant, Shah, Rajiv Ratn, Sitaram, Sunayana

arXiv.org Artificial IntelligenceOct-16-2024

Information in speech can be divided into two categories: "what is being said" (content) and "how it is expressed" (other). Current state-of-the-art (SOTA) techniques model speech at fixed segments, usually 10-25 ms, using a single embedding. Given the orthogonal nature of other and content information, attempting to optimize both within a single embedding results in suboptimal solutions. This approach divides the model's capacity, limiting its ability to build complex hierarchical features effectively. In this work, we present an end-to-end speech representation learning framework designed to jointly optimize the "other" and "content" information (JOOCI) in speech. Our results show that JOOCI consistently outperforms other SOTA models of similar size (100 million parameters) and pre-training data used (960 hours) by a significant margin when evaluated on a range of speech downstream tasks in the SUPERB benchmark, as shown in Table 1. Code and models are available at TBA. Self-supervised learning (SSL) has played a significant role in learning high-level representations of text (Brown et al., 2020), vision (Alexey, 2020), and audio (Baevski et al., 2020; Mohamed et al., 2022; Défossez et al., 2022) data. In this work, we focus on learning high-level representations from raw speech.

artificial intelligence, encoder, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2410.11086

Country: Asia > India > NCT > Delhi (0.04)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.48)

Add feedback

UNIT-DSR: Dysarthric Speech Reconstruction System Using Speech Unit Normalization

Wang, Yuejiao, Wu, Xixin, Wang, Disong, Meng, Lingwei, Meng, Helen

arXiv.org Artificial IntelligenceJan-26-2024

Dysarthric speech reconstruction (DSR) systems aim to automatically convert dysarthric speech into normal-sounding speech. The technology eases communication with speakers affected by the neuromotor disorder and enhances their social inclusion. NED-based (Neural Encoder-Decoder) systems have significantly improved the intelligibility of the reconstructed speech as compared with GAN-based (Generative Adversarial Network) approaches, but the approach is still limited by training inefficiency caused by the cascaded pipeline and auxiliary tasks of the content encoder, which may in turn affect the quality of reconstruction. Inspired by self-supervised speech representation learning and discrete speech units, we propose a Unit-DSR system, which harnesses the powerful domain-adaptation capacity of HuBERT for training efficiency improvement and utilizes speech units to constrain the dysarthric content restoration in a discrete linguistic space. Compared with NED approaches, the Unit-DSR system only consists of a speech unit normalizer and a Unit HiFi-GAN vocoder, which is considerably simpler without cascaded sub-modules or auxiliary tasks. Results on the UASpeech corpus indicate that Unit-DSR outperforms competitive baselines in terms of content restoration, reaching a 28.2% relative average word error rate reduction when compared to original dysarthric speech, and shows robustness against speed perturbation and noise.

normalizer, speech, speech unit, (16 more...)

arXiv.org Artificial Intelligence

2401.14664

Country: Asia > China > Hong Kong (0.05)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (0.34)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.34)

Add feedback

StreamVC: Real-Time Low-Latency Voice Conversion

Yang, Yang, Kartynnik, Yury, Li, Yunpeng, Tang, Jiuqiang, Li, Xing, Sung, George, Grundmann, Matthias

arXiv.org Artificial IntelligenceJan-5-2024

We present StreamVC, a streaming voice conversion solution that preserves the content and prosody of any source speech while matching the voice timbre from any target speech. Unlike previous approaches, StreamVC produces the resulting waveform at low latency from the input signal even on a mobile platform, making it applicable to real-time communication scenarios like calls and video conferencing, and addressing use cases such as voice anonymization in these scenarios. Our design leverages the architecture and training strategy of the SoundStream neural audio codec for lightweight high-quality speech synthesis. We demonstrate the feasibility of learning soft speech units causally, as well as the effectiveness of supplying whitened fundamental frequency information to improve pitch stability without leaking the source timbre information.

conversion, information, voice conversion, (13 more...)

arXiv.org Artificial Intelligence

2401.03078

Country:

North America > United States (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Speech (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)
Information Technology > Communications > Mobile (0.67)

Add feedback

SAIC: Integration of Speech Anonymization and Identity Classification

Cheng, Ming, Diao, Xingjian, Cheng, Shitong, Liu, Wenjun

arXiv.org Artificial IntelligenceDec-23-2023

Speech anonymization and de-identification have garnered significant attention recently, especially in the healthcare area including telehealth consultations, patient voiceprint matching, and patient real-time monitoring. Speaker identity classification tasks, which involve recognizing specific speakers from audio to learn identity features, are crucial for de-identification. Since rare studies have effectively combined speech anonymization with identity classification, we propose SAIC - an innovative pipeline for integrating Speech Anonymization and Identity Classification. SAIC demonstrates remarkable performance and reaches state-of-the-art in the speaker identity classification task on the Voxceleb1 dataset, with a top-1 accuracy of 96.1%. Although SAIC is not trained or evaluated specifically on clinical data, the result strongly proves the model's effectiveness and the possibility to generalize into the healthcare area, providing insightful guidance for future work.

arxiv preprint arxiv, information, saic, (11 more...)

arXiv.org Artificial Intelligence

2312.1519

Country: North America > United States > New Hampshire > Grafton County > Hanover (0.04)

Genre: Research Report (1.00)

Industry: Health & Medicine > Health Care Technology > Telehealth (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.96)

Add feedback

AutoCycle-VC: Towards Bottleneck-Independent Zero-Shot Cross-Lingual Voice Conversion

Choi, Haeyun, Gim, Jio, Lee, Yuho, Kim, Youngin, Suh, Young-Joo

arXiv.org Artificial IntelligenceOct-10-2023

This paper proposes a simple and robust zero-shot voice conversion system with a cycle structure and mel-spectrogram pre-processing. Previous works suffer from information loss and poor synthesis quality due to their reliance on a carefully designed bottleneck structure. Moreover, models relying solely on self-reconstruction loss struggled with reproducing different speakers' voices. To address these issues, we suggested a cycle-consistency loss that considers conversion back and forth between target and source speakers. Additionally, stacked random-shuffled mel-spectrograms and a label smoothing method are utilized during speaker encoder training to extract a time-independent global speaker representation from speech, which is the key to a zero-shot conversion. Our model outperforms existing state-of-the-art results in both subjective and objective evaluations. Furthermore, it facilitates cross-lingual voice conversions and enhances the quality of synthesized speech.

conversion, encoder, voice conversion, (15 more...)

arXiv.org Artificial Intelligence

2310.06546

Country:

Asia > South Korea > Gyeongsangbuk-do > Pohang (0.04)
Asia > India (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.82)

Add feedback

Emotion Embeddings $\unicode{x2014}$ Learning Stable and Homogeneous Abstractions from Heterogeneous Affective Datasets

Buechel, Sven, Hahn, Udo

arXiv.org Artificial IntelligenceAug-15-2023

Human emotion is expressed in many communication modalities and media formats and so their computational study is equally diversified into natural language processing, audio signal analysis, computer vision, etc. Similarly, the large variety of representation formats used in previous research to describe emotions (polarity scales, basic emotion categories, dimensional approaches, appraisal theory, etc.) have led to an ever proliferating diversity of datasets, predictive models, and software tools for emotion analysis. Because of these two distinct types of heterogeneity, at the expressional and representational level, there is a dire need to unify previous work on increasingly diverging data and label types. This article presents such a unifying computational model. We propose a training procedure that learns a shared latent representation for emotions, so-called emotion embeddings, independent of different natural languages, communication modalities, media or representation label formats, and even disparate model architectures. Experiments on a wide range of heterogeneous affective datasets indicate that this approach yields the desired interoperability for the sake of reusability, interpretability and flexibility, without penalizing prediction quality. Code and data are archived under https://doi.org/10.5281/zenodo.7405327 .

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2308.07871

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > New Mexico > Santa Fe County > Santa Fe (0.04)
(33 more...)

Genre: Research Report > New Finding (0.45)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)
Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Emotion (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.92)

Add feedback